Wine Quality Analysis Based on its chemical content

In this report we will do detailed analysis on different chemical composition of white wine and its effect on quality.

Summary information on different variables

## [1] 4898   13
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

There are total of 4898 wine samples.

Univariant Data Analysis

Histogram Analysis

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Most of the wine in this data falls in the quality score of 5, 6, and 7. There is no wine in the data set with quality less than score of 3 or score of 10.

The fixed acidity, volatile acidity, citric acid, chlorides, pH, sulphates, density have normal distribution.

The residual sugar has bimodal distribution, most wine fall in two sugar values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Feature, free sulfur dioxide and total sulfur dioxide has outliers. For free sulfur dioxide the value of median 34, 3rd quadrant is 46 but max value is 289. Similarly for total sulfur dioxide median is 134, 3rd quadrant is 167 but max value is 440.

The alcohol seems to have uniform distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1    1225    2450    2450    3674    4898

The variable x has right skewed distribution.

What is structure of Data Set ?

The data set has 4898 wines with 13 features (X, fixed acidity, volatile acidity, Citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, Density, ph, sulphates, alcohol, quality). Following are observations made about the data * Most of the wine in this data falls in the quality score of 5, 6 and 7. * There is no wine in the data set with quality less than score of 3 or score of 10. * Feature alcohol is uniformly distributed. * Feature X is negatively skewed. * Other features (fixed acidity, volatile acidity, Citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, ph, sulphates, alcohol, quality) are normally distributed.

What are the main features of interest?

The quality is the main feature of interest, I will find out how other features will influence quality. I strongly suspect residual sugar has some relationship with the quality of the wine. Logically it makes sense for alcohol content to have some relationship with wine quality. From this univariant analysis it very difficult to establish any relationship between quality and other features.

What are the unusual patterns did u saw in your univariant data analysis? Did you make any adjustment to the data tidy or transform the shape of the data? If so, why did you do this?

The feature X has unusual shape with histogram with default values, so I applied log transform to the data to obtain right skewed distribution. With most values falling around the value 2500. Residual sugar value is transformed from left skewed to bimodal distribution most of the wine falling around 3 or 9. The bin values is adjusted in all the histograms.

Bivariant Data Analysis

Scatterplot Analysis

##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 733   733           8.8            0.280        0.45            6.0
## 2252 2252           7.4            0.180        0.29            1.4
## 3169 3169           6.2            0.190        0.38            5.1
## 1543 1543           7.2            0.160        0.49            1.3
## 499   499           5.7            0.335        0.34            1.0
## 376   376           5.1            0.330        0.22            1.6
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 733      0.022                  14                   49 0.99340 3.01
## 2252     0.042                  34                  101 0.99384 3.54
## 3169     0.019                  22                   82 0.98961 3.05
## 1543     0.037                  27                  104 0.99240 3.23
## 499      0.040                  13                  174 0.99200 3.27
## 376      0.027                  18                   89 0.98930 3.51
##      sulphates alcohol quality
## 733       0.33    11.1       7
## 2252      0.60    10.5       7
## 3169      0.36    12.5       6
## 1543      0.57    10.6       6
## 499       0.66    10.0       5
## 376       0.38    12.5       7

## 
## Two-Step Estimates
## 
## Correlations/Type of Correlation:
##                             X fixed.acidity volatile.acidity citric.acid
## X                           1       Pearson          Pearson     Pearson
## fixed.acidity         -0.2558             1          Pearson     Pearson
## volatile.acidity     0.002858       -0.0227                1     Pearson
## citric.acid           -0.1499        0.2892          -0.1495           1
## residual.sugar       0.006624       0.08902          0.06429     0.09421
## chlorides            -0.04565       0.02309          0.07051      0.1144
## free.sulfur.dioxide  -0.01193       -0.0494         -0.09701     0.09408
## total.sulfur.dioxide   -0.162       0.09107          0.08926      0.1211
## density                -0.186        0.2653          0.02711      0.1495
## pH                    -0.1158       -0.4259         -0.03192     -0.1637
## sulphates            0.009808      -0.01714         -0.03573     0.06233
## alcohol                0.2137       -0.1209          0.06772    -0.07573
## quality               0.03576       -0.1137          -0.1947   -0.009209
##                      residual.sugar chlorides free.sulfur.dioxide
## X                           Pearson   Pearson             Pearson
## fixed.acidity               Pearson   Pearson             Pearson
## volatile.acidity            Pearson   Pearson             Pearson
## citric.acid                 Pearson   Pearson             Pearson
## residual.sugar                    1   Pearson             Pearson
## chlorides                   0.08868         1             Pearson
## free.sulfur.dioxide          0.2991    0.1014                   1
## total.sulfur.dioxide         0.4014    0.1989              0.6155
## density                       0.839    0.2572              0.2942
## pH                          -0.1941  -0.09044          -0.0006178
## sulphates                  -0.02666   0.01676             0.05922
## alcohol                     -0.4506   -0.3602             -0.2501
## quality                    -0.09758   -0.2099            0.008158
##                      total.sulfur.dioxide  density      pH sulphates
## X                                 Pearson  Pearson Pearson   Pearson
## fixed.acidity                     Pearson  Pearson Pearson   Pearson
## volatile.acidity                  Pearson  Pearson Pearson   Pearson
## citric.acid                       Pearson  Pearson Pearson   Pearson
## residual.sugar                    Pearson  Pearson Pearson   Pearson
## chlorides                         Pearson  Pearson Pearson   Pearson
## free.sulfur.dioxide               Pearson  Pearson Pearson   Pearson
## total.sulfur.dioxide                    1  Pearson Pearson   Pearson
## density                            0.5299        1 Pearson   Pearson
## pH                               0.002321 -0.09359       1   Pearson
## sulphates                          0.1346  0.07449   0.156         1
## alcohol                           -0.4489  -0.7801  0.1214  -0.01743
## quality                           -0.1747  -0.3071 0.09943   0.05368
##                      alcohol quality
## X                    Pearson Pearson
## fixed.acidity        Pearson Pearson
## volatile.acidity     Pearson Pearson
## citric.acid          Pearson Pearson
## residual.sugar       Pearson Pearson
## chlorides            Pearson Pearson
## free.sulfur.dioxide  Pearson Pearson
## total.sulfur.dioxide Pearson Pearson
## density              Pearson Pearson
## pH                   Pearson Pearson
## sulphates            Pearson Pearson
## alcohol                    1 Pearson
## quality               0.4356       1
## 
## Standard Errors:
##                            X fixed.acidity volatile.acidity citric.acid
## X                                                                      
## fixed.acidity        0.01336                                           
## volatile.acidity     0.01429       0.01428                             
## citric.acid          0.01397        0.0131          0.01397            
## residual.sugar       0.01429       0.01418          0.01423     0.01416
## chlorides            0.01426       0.01428          0.01422      0.0141
## free.sulfur.dioxide  0.01429       0.01426          0.01416     0.01416
## total.sulfur.dioxide 0.01392       0.01417          0.01418     0.01408
## density               0.0138       0.01328          0.01428     0.01397
## pH                    0.0141        0.0117          0.01428     0.01391
## sulphates            0.01429       0.01429          0.01427     0.01423
## alcohol              0.01364       0.01408          0.01422     0.01421
## quality              0.01427       0.01411          0.01375     0.01429
##                      residual.sugar chlorides free.sulfur.dioxide
## X                                                                
## fixed.acidity                                                    
## volatile.acidity                                                 
## citric.acid                                                      
## residual.sugar                                                   
## chlorides                   0.01418                              
## free.sulfur.dioxide         0.01301   0.01414                    
## total.sulfur.dioxide        0.01199   0.01372            0.008878
## density                    0.004233   0.01334             0.01305
## pH                          0.01375   0.01417             0.01429
## sulphates                   0.01428   0.01429             0.01424
## alcohol                     0.01139   0.01244              0.0134
## quality                     0.01415   0.01366             0.01429
##                      total.sulfur.dioxide  density      pH sulphates
## X                                                                   
## fixed.acidity                                                       
## volatile.acidity                                                    
## citric.acid                                                         
## residual.sugar                                                      
## chlorides                                                           
## free.sulfur.dioxide                                                 
## total.sulfur.dioxide                                                
## density                           0.01028                           
## pH                                0.01429  0.01416                  
## sulphates                         0.01403  0.01421 0.01394          
## alcohol                           0.01141 0.005594 0.01408   0.01429
## quality                           0.01385  0.01294 0.01415   0.01425
##                      alcohol
## X                           
## fixed.acidity               
## volatile.acidity            
## citric.acid                 
## residual.sugar              
## chlorides                   
## free.sulfur.dioxide         
## total.sulfur.dioxide        
## density                     
## pH                          
## sulphates                   
## alcohol                     
## quality              0.01158
## 
## n = 4898 
## 
## P-values for Tests of Bivariate Normality:
##                               X fixed.acidity volatile.acidity citric.acid
## X                                                                         
## fixed.acidity        1.384e-135                                           
## volatile.acidity       4.43e-79     8.326e-51                             
## citric.acid          8.099e-177    7.094e-126        3.11e-162            
## residual.sugar       1.269e-153    3.961e-142       3.871e-146  6.704e-208
## chlorides                     0             0                0           0
## free.sulfur.dioxide   2.436e-59     9.489e-44        2.307e-50  1.481e-110
## total.sulfur.dioxide  4.165e-65     1.731e-38        3.649e-49  2.145e-108
## density              6.906e-101     2.053e-49        1.458e-45  1.894e-132
## pH                    2.823e-57     5.114e-36        2.379e-36  3.439e-101
## sulphates             1.308e-56     1.076e-33        4.068e-33  4.195e-103
## alcohol              3.053e-105     1.172e-74        1.458e-96  6.265e-186
## quality                       0             0                0           0
##                      residual.sugar chlorides free.sulfur.dioxide
## X                                                                
## fixed.acidity                                                    
## volatile.acidity                                                 
## citric.acid                                                      
## residual.sugar                                                   
## chlorides                         0                              
## free.sulfur.dioxide      2.279e-119         0                    
## total.sulfur.dioxide     9.659e-122         0           2.231e-30
## density                   3.89e-196         0           1.384e-52
## pH                       1.085e-119         0           3.012e-24
## sulphates                2.257e-116         0            1.06e-18
## alcohol                  3.624e-202         0           9.643e-71
## quality                           0         0                   0
##                      total.sulfur.dioxide    density        pH sulphates
## X                                                                       
## fixed.acidity                                                           
## volatile.acidity                                                        
## citric.acid                                                             
## residual.sugar                                                          
## chlorides                                                               
## free.sulfur.dioxide                                                     
## total.sulfur.dioxide                                                    
## density                         1.193e-28                               
## pH                              3.591e-17  1.448e-34                    
## sulphates                       6.053e-32  1.796e-35 1.473e-17          
## alcohol                         2.343e-57 3.223e-108 2.598e-62 3.961e-84
## quality                                 0          0         0         0
##                      alcohol
## X                           
## fixed.acidity               
## volatile.acidity            
## citric.acid                 
## residual.sugar              
## chlorides                   
## free.sulfur.dioxide         
## total.sulfur.dioxide        
## density                     
## pH                          
## sulphates                   
## alcohol                     
## quality                    0

The Pearson R between quality and features are quiet low. The alcohol and quality has highest Pearson R at 0.4356. Other features that might influence quality are fixed acidity, volatile acidity,residual sugar,chlorides, total sulfur dioxide, density and sulphates.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747

From the scatter plot between alcohol and quality we can see alcohol quality at 5,6 and 7 have range of alcohol content from 8 percent to 13 percent. The lower quality wine of 5 and below have alcohol content predominantly in the range of 8 and 11.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

Density and alcohol seems to have negative correlation with Pearson R value of -0.7801. Wine with higher alcohol content have lower density.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$residual.sugar and wine$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

The residual sugar and density has positive correlation with correlation value of 0.839

As expected the density and quality is opposite of alcohol and quality distribution. For instance lower wine quality have predominantly higher quality and lower density.

It is difficult to establish any relationship between fixed acidity and quality. The volatile acidity has lot of outliers, even after removing outliers I cannot establish any relationship between quality and fixed acidity. All the quality values has similar distribution of volatile acidity and fixed acidity.

The residual sugar has lot of outliers, so only top 99 percentile is taken into analysis. From the scatter plot we can infer that wine quality which is 4 or below has predominantly lower residual sugar. Wine quality of 5 and above have similar distribution of residual sugar to one another. This is surprise as I was expecting distribution similar to density and quality, but it was similar to alcohol and quality.

The chlorides has some outliers so only top 99 percentile is considered. The middle wine quality has wide range of chlorides on the other hand lower and higher wine quality has lower chloride values. This may be because middle wine quality has more sample and hence the variance is quiet high.

No relationship could be drawn using quality and total sulfur dioxide.

Different wine quality of wine has similar distribution of sulphates.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

To do box plot analysis, quality is converted to factor variable.

The wine quality of 6 and above have higher median alcohol value, median alcohol value have increasing trend from alcohol quality 6 and above. As expected we can see exact opposite trend with density.

I could not find any definite pattern with boxplot for quality with residual sugar, total sulfur dioxide, sulphates.

Lower quality wine has higher median chloride content compared to higher wine quality.

Bivariant Data Analysis

What are the relationship between features that where observed in bivariant analysis?

We can find relationship between quality and alcohol with correlation coefficient of 0.4356. The higher quality of wine has higher the alcohol content. The higher wine quality of 7, 8 and 9 have higher median alcohol content compared to lower wine quality

The next feature that influence the wine quality is density, it has correlation value of -0.3071, the density is physical property which is affected by other chemical feature that is present in the wine, in our case it is affected by alcohol and residual sugar. I strongly suspect density does not affect wine quality in a big way as the density itself affected by presence of other chemicals.

The chloride has negative correlation with wine quality with correlation coefficient of -0.2099. The higher wine quality of 7, 8 and 9 have lower median chloride content compared to lower wine quality.

The other features that seems to have effect on wine quality are fixed acidity, volatile acidity, residual sugar, total sulfur dioxide and sulphates. Further analysis is needed to determine the relationship between these feature and wine quality.

Did you find any relationship between other features other than main feature(s) of interest?

The density strongly correlates with residual sugar. The correlation coefficient between density and residual sugar is 0.839.

There is strong negative correlation between alcohol and density, higher the percentage of alcohol lower is the density.

What are the features that have strong relationship with feature of interest?

I found strong positive correlation between alcohol and quality. The density had strong negative correlation. The chloride is another feature that has negative correlation.

MultiVariant Analysis

## 
##   high    low medium 
##   1060   1640   2198

The wine is divided into three categories of low, medium and high. Quality value less than 6 is categorized as low, wine quality of 6 is categorized as medium and wine quality of greater than 6 is categorized as high.

The low wine quality has most of alcohol value of 11 or lower and volatile acidity in range of 0.2 to 0.6. The medium wine quality has alcohol content are predominantly 11 or lower and volatile acidity in range of 0.1 to 0.5. The high wine quality has alcohol content that are predominantly 11 or higher and volatile acidity in range of 0.1 to 0.5. This behavior is quiet expected as positive correlation between alcohol and wine quality whereas we have negative correlation between volatile acidity and wine quality.

Low and medium wine quality has most of fixed acidity value from 5 to 8.5 and alcohol content less than 11. Whereas high wine quality has most of the values from 5 to 7.5 and alcohol content higher than 11. This is consistent with our correlation analysis.

Low and medium wine quality has most of Residual Sugar value from 2 to 20 and alcohol content less than 11. Whereas high wine quality has most of the values from 2 to 13 and alcohol content higher than 11. This is consistent with our correlation analysis.

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :  11   Min.   : 4.200   Min.   :0.1000   Min.   :0.0000  
##  1st Qu.:1135   1st Qu.: 6.400   1st Qu.:0.2400   1st Qu.:0.2400  
##  Median :2238   Median : 6.800   Median :0.2900   Median :0.3200  
##  Mean   :2318   Mean   : 6.962   Mean   :0.3103   Mean   :0.3343  
##  3rd Qu.:3533   3rd Qu.: 7.500   3rd Qu.:0.3500   3rd Qu.:0.4100  
##  Max.   :4895   Max.   :11.800   Max.   :1.1000   Max.   :1.0000  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.04000   1st Qu.: 20.00     
##  Median : 6.625   Median :0.04700   Median : 34.00     
##  Mean   : 7.054   Mean   :0.05144   Mean   : 35.34     
##  3rd Qu.:11.025   3rd Qu.:0.05300   3rd Qu.: 49.00     
##  Max.   :23.500   Max.   :0.34600   Max.   :289.00     
##                                                        
##  total.sulfur.dioxide    density             pH         sulphates     
##  Min.   :  9.0        Min.   :0.9872   Min.   :2.79   Min.   :0.2500  
##  1st Qu.:117.0        1st Qu.:0.9932   1st Qu.:3.08   1st Qu.:0.4100  
##  Median :149.0        Median :0.9951   Median :3.16   Median :0.4700  
##  Mean   :148.6        Mean   :0.9952   Mean   :3.17   Mean   :0.4815  
##  3rd Qu.:182.0        3rd Qu.:0.9971   3rd Qu.:3.24   3rd Qu.:0.5300  
##  Max.   :440.0        Max.   :1.0024   Max.   :3.79   Max.   :0.8800  
##                                                                       
##     alcohol         quality      qualityfactor qualityLabel      
##  Min.   : 8.00   Min.   :3.000   3:  20        Length:1640       
##  1st Qu.: 9.20   1st Qu.:5.000   4: 163        Class :character  
##  Median : 9.60   Median :5.000   5:1457        Mode  :character  
##  Mean   : 9.85   Mean   :4.876   6:   0                          
##  3rd Qu.:10.40   3rd Qu.:5.000   7:   0                          
##  Max.   :13.60   Max.   :5.000   8:   0                          
##                                  9:   0

Low and medium wine quality has most of Free Sulfur Dioxide value from 10 to 60 and alcohol content less than 11. Whereas high wine quality has most of the values from 25 to 50 and alcohol content higher than 11. This explains the weak correlation between wine quality and free sulfur dioxide.

Low and medium wine quality has most of sulphates value from .3 to .6 and alcohol content less than 11. Whereas high wine quality has most of the values from .25 to .7 and alcohol content higher than 11. This is consistent with our correlation analysis.

Chlorides and free sulfur dioxide for different alcohol content does not seems to have any effect on quality.

##             alcohol    volatile.acidity             density 
##            5.142229            1.041865           16.008081 
##       fixed.acidity      residual.sugar free.sulfur.dioxide 
##            1.406153            7.233439            1.147541 
##           sulphates 
##            1.125417
##             alcohol    volatile.acidity       fixed.acidity 
##            1.303170            1.029825            1.026391 
##      residual.sugar free.sulfur.dioxide           sulphates 
##            1.346054            1.147219            1.006988

Since density is affected by alcohol and residual sugar, before the linear regression analysis variation inflation factor (vif) of all our variables has to be checked. The density has high vif and after removing density, vif for other variables are acceptable.

Thus we will use alcohol, volatile acidity, fixed acidity, residual sugar, free sulfur dioxide and sulphates to create linear model with quality.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "qualityfactor"        "qualityLabel"
## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wine)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + fixed.acidity, 
##     data = wine)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + fixed.acidity + 
##     residual.sugar, data = wine)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + fixed.acidity + 
##     residual.sugar + free.sulfur.dioxide, data = wine)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + fixed.acidity + 
##     residual.sugar + free.sulfur.dioxide + pH, data = wine)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + fixed.acidity + 
##     residual.sugar + free.sulfur.dioxide + sulphates, data = wine)
## 
## ====================================================================================================
##                           m1         m2         m3         m4         m5         m6         m7      
## ----------------------------------------------------------------------------------------------------
##   (Intercept)           2.582***   3.017***   3.548***   2.919***   2.663***   1.901***   2.446***  
##                        (0.098)    (0.098)    (0.141)    (0.150)    (0.157)    (0.338)    (0.164)    
##   alcohol               0.313***   0.324***   0.319***   0.370***   0.377***   0.377***   0.378***  
##                        (0.009)    (0.009)    (0.009)    (0.010)    (0.010)    (0.010)    (0.010)    
##   volatile.acidity                -1.979***  -1.988***  -2.119***  -2.052***  -2.043***  -2.040***  
##                                   (0.110)    (0.109)    (0.109)    (0.109)    (0.109)    (0.109)    
##   fixed.acidity                              -0.068***  -0.074***  -0.068***  -0.052***  -0.067***  
##                                              (0.013)    (0.013)    (0.013)    (0.014)    (0.013)    
##   residual.sugar                                         0.027***   0.024***   0.025***   0.025***  
##                                                         (0.002)    (0.002)    (0.003)    (0.002)    
##   free.sulfur.dioxide                                               0.004***   0.004***   0.004***  
##                                                                    (0.001)    (0.001)    (0.001)    
##   pH                                                                           0.205*               
##                                                                               (0.081)               
##   sulphates                                                                               0.412***  
##                                                                                          (0.095)    
## ----------------------------------------------------------------------------------------------------
##   R-squared                 0.2        0.2        0.2        0.3        0.3        0.3        0.3   
##   adj. R-squared            0.2        0.2        0.2        0.3        0.3        0.3        0.3   
##   sigma                     0.8        0.8        0.8        0.8        0.8        0.8        0.8   
##   F                      1146.4      773.9      527.7      437.6      358.3      300.0      302.8   
##   p                         0.0        0.0        0.0        0.0        0.0        0.0        0.0   
##   Log-likelihood        -5839.4    -5681.8    -5668.2    -5605.7    -5590.5    -5587.2    -5581.1   
##   Deviance               3112.3     2918.3     2902.2     2829.0     2811.5     2807.8     2800.7   
##   AIC                   11684.8    11371.6    11346.4    11223.4    11195.0    11190.5    11178.2   
##   BIC                   11704.3    11397.5    11378.9    11262.4    11240.5    11242.5    11230.2   
##   N                      4898       4898       4898       4898       4898       4898       4898     
## ====================================================================================================

What are the new feature introduced for multi variant analysis?

The wine quality is categorized into three categories low, medium and good. Quality value less than 6 is categorized as low, wine quality of 6 is categorized as medium and wine quality of greater than 6 is categorized as high.

What are the observation made in your analysis? Were the features strengthened each other in terms of looking at feature of interest?

Alcohol content together with volatile acidity, fixed acidity, residual sugar, free sulfur dioxide, pH and sulphates seems to have effect on quality of wine.

High alcohol with lower fixed acidity, volatile acidity and residual sugar produce high wine quality. On the other hand high alcohol content with high free sulfur dioxide, pH and sulphates produce high wine quality.

Is there any surprising finding in your analysis?

I was expecting some relationship between alcohol and chloride, alcohol and free sulfur dioxide on wine quality. But I was surprised to find no relationship between these features on wine quality.

Did you create any models with your dataset? Discuss details of the model?

Yes I created linear model using alcohol, volatile acidity, fixed acidity, residual sugar, free sulfur dioxide, pH and sulphates.

The model accounts for 30% variance in quality of the wine. The density is not added to this model because of high variance inflation factor .The variance observed is not very high and thus it may not be reliable to predict the wine quality based on this model.

Final Plots and Summary

Plot One

Description One

Most of the wine for this dataset in available for value of 5,6 and 7. And there is no wine with quality less than zero and wine quality at 10.

Plot Two

Description Two

From the scatter plot between alcohol and quality we can see alcohol quality at 5, 6 and 7 have range of alcohol content from 8 percent to 13 percent. This is because wine quality with these values have larger wine samples compared to other wine quality.

The lower quality wine of 5 and below have alcohol content predominantly in the range of 8 and 11. The wine quality of 6 and above have higher median alcohol value, median alcohol value have increasing trend from wine quality of 6 and above.

Plot Three

Description Three

For the above plots we only consider alcohol value from 8 to 14 and residual sugar from 0 to 20 as most of the wine sample fall in this value range. We can see wine quality in range of 5,6 and 7 dominates the plot.

Low and medium wine quality has most of Residual Sugar value from 2 to 20 and alcohol content less than 11. Whereas high wine quality has most of the values from 2 to 13 and alcohol content higher than 11. This is consistent with our correlation analysis.

Reflection

There are 4898 wine samples in the dataset. I started exploring the data dataset using single variables. Later I formulated some questions and explored some interesting features in the dataset. Finally I explored the relationship between wine quality and other chemical features in the dataset.

Wine quality has positive correlation with alcohol, free sulfur dioxide, pH and sulphates. On the other hand wine quality has negative correlation with density, chlorides, lower fixed acidity, volatile acidity and residual sugar. With further analysis free sulfur dioxide, chlorides has relatively low influence on wine quality. So I created a linear model using alcohol, volatile acidity, fixed acidity, residual sugar, free sulfur dioxide, pH and sulphates. The model accounts for 30% variance in quality of the wine. The density is not added to this model because of high variance inflation factor .The variance observed is not very high and thus it may not be reliable to predict the wine quality based on these features.

The main drawback in the dataset is that wine count for some quality wines is quiet low. There is no wine at wine quality less than 3 and wine quality of 10. Furthermore for wine quality of 3,4,8 and 9 has only 20,163, 175 and 9 wine samples respectively. On the other hand wine quality of 5, 6 and 7 accounts for wine count of 1457, 2198 and 880 respectively. A dataset with evenly distributed wine count for different wine quality would make analysis on wine quality much more reliable and predictive model will be much more accurate.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot. Similarly message = FALSE parameter was added to